Building a multilingual parallel corpus for human users
نویسندگان
چکیده
We present the architecture and the current state of InterCorp, a multilingual parallel corpus centered around Czech, intended primarily for human users and consisting of written texts with a focus on fiction. Following an outline of its recent development and a comparison with some other multilingual parallel corpora we give an overview of the data collection procedure that covers text selection criteria, data format, conversion, alignment, lemmatization and tagging. Finally, we discuss challenges and prospects of the project.
منابع مشابه
Comparing two acquisition systems for automatically building an English-Croatian parallel corpus from multilingual websites
In this paper we compare two tools for automatically harvesting bitexts from multilingual websites: bitextor and ILSP-FC. We used both tools for crawling 21 multilingual websites from the tourism domain to build a domain-specific English–Croatian parallel corpus. Different settings were tried for both tools and 10,662 unique document pairs were obtained. A sample of about 10% of them was manual...
متن کاملBuilding the multilingual TUT parallel treebank
The paper introduces an ongoing project for the development of a parallel treebank for Italian, English and French annotated in the pure dependency format of the Turin University Treebank, i.e. Parallel–TUT. We hypothesize that the major features of this annotation format can be of some help in addressing the typical issues related to parallel corpora, e.g. alignment at various levels. Therefor...
متن کاملBrowsing Multilingual Information with the MultiSemCor Web Interface
Parallel and comparable corpora represent a crucial resource for different Natural Language Processing tasks like machine translation, lexical acquisition, and knowledge structuring but are also suitable to be consulted by humans for different purposes, such as linguistic teaching, corpus linguistics, translation studies, lexicography, multilingual information browsing. To enhance their exploit...
متن کاملMultilingual Corpora - Current Practice and Future Trends
In this paper I would like to give an overview of multilingual corpus building to date. In doing so, I will review two types of multilingual corpus, parallel and translation corpora. Following this, I will consider what tools are currently available which allow for the exploitation of such corpora in the context of machine/machine aided translation. Throughout I will give a fairly global view o...
متن کاملExploiting Aligned Parallel Corpora in Multilingual Studies and Applications
Parallel corpora encode extremely valuable linguistic knowledge, the revealing of which is facilitated by the recent advances in multilingual corpus linguistics. The linguistic decisions made by the human translators in order to faithfully convey the meaning of the source text can be traced and used as evidence on linguistic facts which, in a monolingual context, might be unavailable to (or ove...
متن کامل